In our final group assignment, we will analyse data of Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in Berlin. We will download AirBnB data from insideairbnb.com; it was originally scraped from airbnb.com.

vroom will download the *.gz zipped file, unzip, and provide us with a dataframe.

There are many variables in the dataframe. Here is a quick description of some of the variables collected, and you can find a data dictionary here

  • price = cost per night

  • property_type: type of accommodation (House, Apartment, etc.)

  • room_type:

    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing

  • review_scores_rating: Average review score (0 - 100)

  • longitude , latitude: geographical coordinates to help us locate the listing

  • neighbourhood*: three variables on a few major neighbourhoods in each city

1 Exploratory Data Analysis (EDA)

We start by looking at the raw values.

listings %>% 
  glimpse()
Rows: 18,288
Columns: 74
$ id                                           <dbl> 2015, 3176, 7071, 9991, 1~
$ listing_url                                  <chr> "https://www.airbnb.com/r~
$ scrape_id                                    <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped                                 <date> 2021-09-22, 2021-09-22, ~
$ name                                         <chr> "Berlin-Mitte Value! Quie~
$ description                                  <chr> "Great location!  <br />3~
$ neighborhood_overview                        <chr> "It is located in the for~
$ picture_url                                  <chr> "https://a0.muscache.com/~
$ host_id                                      <dbl> 2217, 3718, 17391, 33852,~
$ host_url                                     <chr> "https://www.airbnb.com/u~
$ host_name                                    <chr> "Ion", "Britta", "BrightR~
$ host_since                                   <date> 2008-08-18, 2008-10-19, ~
$ host_location                                <chr> "Key Biscayne, Florida, U~
$ host_about                                   <chr> "Isn’t sharing economy gr~
$ host_response_time                           <chr> "within an hour", "a few ~
$ host_response_rate                           <chr> "100%", "40%", "100%", "N~
$ host_acceptance_rate                         <chr> "91%", "100%", "N/A", "0%~
$ host_is_superhost                            <lgl> TRUE, FALSE, TRUE, FALSE,~
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/~
$ host_picture_url                             <chr> "https://a0.muscache.com/~
$ host_neighbourhood                           <chr> "Mitte", "Prenzlauer Berg~
$ host_listings_count                          <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_total_listings_count                    <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_verifications                           <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified                       <lgl> FALSE, TRUE, TRUE, TRUE, ~
$ neighbourhood                                <chr> "Berlin, Germany", "Berli~
$ neighbourhood_cleansed                       <chr> "Brunnenstr. Süd", "Prenz~
$ neighbourhood_group_cleansed                 <chr> "Mitte", "Pankow", "Panko~
$ latitude                                     <dbl> 52.53305, 52.53471, 52.54~
$ longitude                                    <dbl> 13.40394, 13.41810, 13.41~
$ property_type                                <chr> "Entire guesthouse", "Ent~
$ room_type                                    <chr> "Entire home/apt", "Entir~
$ accommodates                                 <dbl> 2, 4, 2, 7, 1, 5, 2, 4, 4~
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text                               <chr> "1 bath", "1 bath", "1 sh~
$ bedrooms                                     <dbl> 1, 1, 1, 4, NA, 1, NA, 2,~
$ beds                                         <dbl> 0, 2, 2, 7, 1, 3, 0, 2, 2~
$ amenities                                    <chr> "[\"Refrigerator\", \"Hea~
$ price                                        <chr> "$77.00", "$90.00", "$33.~
$ minimum_nights                               <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ maximum_nights                               <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_minimum_nights                       <dbl> 33, 62, 1, 6, 90, 60, 5, ~
$ maximum_minimum_nights                       <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ minimum_maximum_nights                       <dbl> 1125, 1125, 10, 14, 1125,~
$ maximum_maximum_nights                       <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_nights_avg_ntm                       <dbl> 88.2, 62.0, 1.0, 6.0, 90.~
$ maximum_nights_avg_ntm                       <dbl> 1125.0, 1125.0, 10.0, 14.~
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30                              <dbl> 0, 9, 0, 0, 0, 0, 0, 3, 0~
$ availability_60                              <dbl> 21, 9, 0, 0, 1, 0, 4, 31,~
$ availability_90                              <dbl> 51, 9, 0, 0, 31, 0, 4, 61~
$ availability_365                             <dbl> 326, 93, 0, 0, 102, 144, ~
$ calendar_last_scraped                        <date> 2021-09-22, 2021-09-22, ~
$ number_of_reviews                            <dbl> 143, 147, 293, 8, 26, 48,~
$ number_of_reviews_ltm                        <dbl> 10, 1, 0, 0, 1, 0, 21, 2,~
$ number_of_reviews_l30d                       <dbl> 1, 0, 0, 0, 0, 0, 3, 0, 0~
$ first_review                                 <date> 2016-04-11, 2010-12-21, ~
$ last_review                                  <date> 2021-07-22, 2017-03-20, ~
$ review_scores_rating                         <dbl> 4.66, 4.63, 4.83, 5.00, 4~
$ review_scores_accuracy                       <dbl> 4.79, 4.68, 4.85, 5.00, 5~
$ review_scores_cleanliness                    <dbl> 4.52, 4.53, 4.90, 5.00, 4~
$ review_scores_checkin                        <dbl> 4.88, 4.64, 4.86, 5.00, 4~
$ review_scores_communication                  <dbl> 4.89, 4.69, 4.85, 5.00, 4~
$ review_scores_location                       <dbl> 4.96, 4.92, 4.91, 4.86, 4~
$ review_scores_value                          <dbl> 4.59, 4.63, 4.71, 4.86, 4~
$ license                                      <chr> NA, NA, NA, "03/Z/RA/0034~
$ instant_bookable                             <lgl> FALSE, FALSE, TRUE, FALSE~
$ calculated_host_listings_count               <dbl> 5, 1, 1, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_entire_homes  <dbl> 5, 1, 0, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month                            <dbl> 2.15, 1.12, 2.40, 0.16, 0~

Our dataset has 18,288 observations and 74 variables. It includes details about Airbnb listings in Berlin.

1.0.1 Computing summary statistics of the variables of interest, or finding NAs

listings %>% 
  skim()
Data summary
Name Piped data
Number of rows 18288
Number of columns 74
_______________________
Column type frequency:
character 25
Date 5
logical 7
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 18288 0
name 29 1.00 1 255 0 17766 0
description 544 0.97 1 1000 0 17156 0
neighborhood_overview 8702 0.52 1 1000 0 8570 0
picture_url 0 1.00 60 126 0 18047 0
host_url 0 1.00 38 43 0 14776 0
host_name 16 1.00 1 35 0 5177 0
host_location 59 1.00 1 199 0 952 0
host_about 9327 0.49 1 5095 0 6642 21
host_response_time 16 1.00 3 18 0 5 0
host_response_rate 16 1.00 2 4 0 66 0
host_acceptance_rate 16 1.00 2 4 0 99 0
host_thumbnail_url 16 1.00 55 106 0 14674 0
host_picture_url 16 1.00 57 109 0 14674 0
host_neighbourhood 6091 0.67 1 28 0 165 0
host_verifications 0 1.00 2 158 0 318 0
neighbourhood 8702 0.52 7 43 0 50 0
neighbourhood_cleansed 0 1.00 4 41 0 137 0
neighbourhood_group_cleansed 0 1.00 5 24 0 12 0
property_type 0 1.00 3 35 0 68 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 26 1.00 6 17 0 27 0
amenities 0 1.00 2 1416 0 15257 0
price 0 1.00 5 9 0 430 0
license 16019 0.12 3 342 0 1921 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.0 2021-09-21 2021-10-03 2021-09-22 4
host_since 16 1.0 2008-08-08 2021-09-20 2015-09-16 3562
calendar_last_scraped 0 1.0 2021-09-21 2021-10-03 2021-09-22 4
first_review 3572 0.8 2010-12-21 2021-09-22 2018-07-10 2771
last_review 3572 0.8 2012-07-08 2021-09-26 2019-09-28 2226

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 16 1 0.15 FAL: 15448, TRU: 2824
host_has_profile_pic 16 1 0.99 TRU: 18178, FAL: 94
host_identity_verified 16 1 0.79 TRU: 14412, FAL: 3860
bathrooms 18288 0 NaN :
calendar_updated 18288 0 NaN :
has_availability 0 1 0.98 TRU: 17929, FAL: 359
instant_bookable 0 1 0.30 FAL: 12737, TRU: 5551

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.557156e+07 15400105.68 2.015000e+03 1.218794e+07 2.385470e+07 3.968697e+07 5.238006e+07 <U+2587><U+2587><U+2587><U+2586><U+2587>
scrape_id 0 1.00 2.021092e+13 0.00 2.021092e+13 2.021092e+13 2.021092e+13 2.021092e+13 2.021092e+13 <U+2581><U+2581><U+2587><U+2581><U+2581>
host_id 0 1.00 9.337946e+07 108308811.44 1.581000e+03 1.194556e+07 4.352120e+07 1.449065e+08 4.238179e+08 <U+2587><U+2582><U+2581><U+2581><U+2581>
host_listings_count 16 1.00 4.560000e+00 40.36 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.010000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
host_total_listings_count 16 1.00 4.560000e+00 40.36 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.010000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
latitude 0 1.00 5.251000e+01 0.03 5.234000e+01 5.249000e+01 5.251000e+01 5.253000e+01 5.266000e+01 <U+2581><U+2581><U+2587><U+2583><U+2581>
longitude 0 1.00 1.341000e+01 0.06 1.310000e+01 1.337000e+01 1.341000e+01 1.344000e+01 1.376000e+01 <U+2581><U+2582><U+2587><U+2581><U+2581>
accommodates 0 1.00 2.710000e+00 1.62 0.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.600000e+01 <U+2587><U+2582><U+2581><U+2581><U+2581>
bedrooms 1609 0.91 1.270000e+00 0.63 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.200000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
beds 227 0.99 1.620000e+00 1.24 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.700000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights 0 1.00 9.320000e+00 34.24 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights 0 1.00 5.883000e+02 528.30 1.000000e+00 2.800000e+01 3.650000e+02 1.125000e+03 5.000000e+03 <U+2587><U+2587><U+2581><U+2581><U+2581>
minimum_minimum_nights 1 1.00 9.220000e+00 34.15 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_minimum_nights 1 1.00 9.880000e+00 35.35 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_maximum_nights 1 1.00 4.704186e+05 31757979.28 1.000000e+00 3.000000e+01 1.125000e+03 1.125000e+03 2.147484e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_maximum_nights 1 1.00 5.878616e+05 35505529.04 1.000000e+00 3.000000e+01 1.125000e+03 1.125000e+03 2.147484e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
minimum_nights_avg_ntm 1 1.00 9.600000e+00 34.52 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
maximum_nights_avg_ntm 1 1.00 5.875923e+05 35489477.79 1.000000e+00 3.000000e+01 1.125000e+03 1.125000e+03 2.147484e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_30 0 1.00 3.810000e+00 7.72 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 3.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_60 0 1.00 1.019000e+01 17.76 0.000000e+00 0.000000e+00 0.000000e+00 1.500000e+01 6.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_90 0 1.00 1.832000e+01 29.27 0.000000e+00 0.000000e+00 0.000000e+00 3.500000e+01 9.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
availability_365 0 1.00 8.556000e+01 124.51 0.000000e+00 0.000000e+00 0.000000e+00 1.620000e+02 3.650000e+02 <U+2587><U+2581><U+2581><U+2581><U+2582>
number_of_reviews 0 1.00 2.279000e+01 51.02 0.000000e+00 1.000000e+00 4.000000e+00 1.700000e+01 6.550000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_ltm 0 1.00 2.680000e+00 9.36 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 4.470000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
number_of_reviews_l30d 0 1.00 4.500000e-01 1.65 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.040000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
review_scores_rating 3572 0.80 4.630000e+00 0.80 0.000000e+00 4.610000e+00 4.850000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_accuracy 3897 0.79 4.790000e+00 0.41 0.000000e+00 4.750000e+00 4.920000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_cleanliness 3895 0.79 4.640000e+00 0.53 0.000000e+00 4.500000e+00 4.800000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_checkin 3909 0.79 4.830000e+00 0.39 0.000000e+00 4.800000e+00 4.960000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_communication 3898 0.79 4.830000e+00 0.40 0.000000e+00 4.810000e+00 4.970000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_location 3908 0.79 4.760000e+00 0.38 0.000000e+00 4.670000e+00 4.880000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
review_scores_value 3910 0.79 4.670000e+00 0.45 0.000000e+00 4.550000e+00 4.760000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
calculated_host_listings_count 0 1.00 3.030000e+00 7.45 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 7.600000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_entire_homes 0 1.00 1.940000e+00 5.42 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 4.400000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_private_rooms 0 1.00 8.900000e-01 3.25 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 4.500000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
calculated_host_listings_count_shared_rooms 0 1.00 1.400000e-01 2.02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.800000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
reviews_per_month 3572 0.80 8.200000e-01 1.58 1.000000e-02 9.000000e-02 3.000000e-01 1.000000e+00 9.086000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>

Using the skim function, we can see that we have 25 character variables, 5 date variables, 7 logical variables and 37 numeric variables. While many variables have no missing values, some variables like ‘host_about’, ‘host_neighbourhood’, ‘neighbourhood’ and ‘license’ have many missing observations. In fact, ‘bathrooms’ and ‘calendar_update’ have no observations at all.

We will now list the variable names by type:

1.0.2 Numeric

listings %>% 
  skim() %>% 
  filter(skim_type=="numeric") %>% 
  select(skim_variable)

skim_variable
id
scrape_id
host_id
host_listings_count
host_total_listings_count
latitude
longitude
accommodates
bedrooms
beds
minimum_nights
maximum_nights
minimum_minimum_nights
maximum_minimum_nights
minimum_maximum_nights
maximum_maximum_nights
minimum_nights_avg_ntm
maximum_nights_avg_ntm
availability_30
availability_60
availability_90
availability_365
number_of_reviews
number_of_reviews_ltm
number_of_reviews_l30d
review_scores_rating
review_scores_accuracy
review_scores_cleanliness
review_scores_checkin
review_scores_communication
review_scores_location
review_scores_value
calculated_host_listings_count
calculated_host_listings_count_entire_homes
calculated_host_listings_count_private_rooms
calculated_host_listings_count_shared_rooms
reviews_per_month
### Factor

listings %>% 
  skim() %>% 
  filter(skim_type=="logical") %>% 
  select(skim_variable)

skim_variable
host_is_superhost
host_has_profile_pic
host_identity_verified
bathrooms
calendar_updated
has_availability
instant_bookable
### Character

listings %>% 
  skim() %>% 
  filter(skim_type=="character") %>% 
  select(skim_variable)

skim_variable
listing_url
name
description
neighborhood_overview
picture_url
host_url
host_name
host_location
host_about
host_response_time
host_response_rate
host_acceptance_rate
host_thumbnail_url
host_picture_url
host_neighbourhood
host_verifications
neighbourhood
neighbourhood_cleansed
neighbourhood_group_cleansed
property_type
room_type
bathrooms_text
amenities
price
license
Notice that the variables host_response_rate, host_acceptance_rate, and price have been categorized as character strings instead of numeric variables.

Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the first number. We will also convert host_response_rate and host_acceptance_rate.

listings <- listings %>% 
  mutate(price = parse_number(price),
         host_response_rate = parse_number(host_response_rate)/100,
         host_acceptance_rate = parse_number(host_acceptance_rate)/100
         )
# We check the new type of the three variables ...
typeof(listings$price)
[1] "double"
typeof(listings$host_response_rate)
[1] "double"
typeof(listings$host_acceptance_rate) #Price, host response rate and host acceptance rate are now stored as numbers for the training set
[1] "double"
# ... and take a closer look at the variables
listings %>% 
  select(price, host_response_rate, host_acceptance_rate) %>% 
  glimpse()
Rows: 18,288
Columns: 3
$ price                <dbl> 77, 90, 33, 180, 70, 90, 49, 169, 70, 65, 93, 65,~
$ host_response_rate   <dbl> 1.00, 0.40, 1.00, NA, 1.00, 1.00, 1.00, NA, 1.00,~
$ host_acceptance_rate <dbl> 0.91, 1.00, NA, 0.00, 0.56, NA, 0.98, NA, NA, 0.7~

1.1 The Variables & Data Wrangling

Now we will look at the variables in more detail and construct our dependent variable of interest: price per 4 nights for 2 people.

1.2 Minimum Nights

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. However, some properties might be intended for other purposes. We only want to include listings in our regression analysis that are intended for travel purposes. In order to do this, we need to look at minimum_nights in more detail:

# First, we want to see what are the most common values for minimum_nights
listings %>%
  count(minimum_nights) %>%
  arrange(desc(n))
minimum_nightsn
2       4236
1       4194
3       3282
4       1368
5       1293
7       864
30       418
6       382
14       363
60       298
10       284
90       195
20       117
28       95
15       82
8       75
21       72
180       53
12       43
9       39
25       37
13       33
29       32
61       31
62       28
22       26
120       23
31       17
183       16
45       14
93       13
18       11
150       11
16       10
89       10
91       10
40       9
58       9
100       9
357       9
11       7
19       7
50       7
56       7
365       7
23       6
27       6
181       6
65       5
92       5
200       5
1e+03       5
17       4
55       4
63       4
70       4
80       4
85       4
300       4
24       3
42       3
59       3
99       3
118       3
182       3
186       3
500       3
1.12e+033
26       2
33       2
35       2
83       2
84       2
140       2
185       2
240       2
360       2
34       1
37       1
48       1
49       1
51       1
71       1
75       1
82       1
87       1
88       1
98       1
101       1
105       1
119       1
122       1
125       1
128       1
129       1
170       1
179       1
184       1
187       1
188       1
210       1
250       1
270       1
304       1
355       1
356       1
720       1
1.1e+03 1
# We also look at the range of values for minimum_nights 
favstats(~minimum_nights, data = listings)
minQ1medianQ3maxmeansdnmissing
12351.12e+039.3234.2182880

As we predicted, most properties on Airbnb seem to be meant for travelers. The most common values are 2,1 and 3 nights, which would be reasonable for a tourist. We can also see, however, that 30 and 60 day minimum stay are among the 10 most common type of properties. We can guess these properties are meant as a substitute to renting from an agent or a landlord. In other words, these properties are meant for long-term stay and are not appropriate for our analysis. Additionally, we can see that the median minimum stay requirement is 3 nights, where as the mean is around 9. The discrepancy comes from the large outliers we see in the data. For example, one property has a minimum stay of 1,124 nights, which heavily skews our data.

To deal with this, we will filter the data to include only properties that have a minimum stay of 4 nights or less.

listings <- listings %>% 
  filter(minimum_nights <= 4)

favstats(~minimum_nights, data = listings)
minQ1medianQ3maxmeansdnmissing
112342.140.985130800
listings %>% 
  ggplot(aes(x = minimum_nights)) +
  geom_histogram(fill = "light blue",
                 colour = "grey") +
  labs(title = "Distribution of Minimum Stay Requirements",
       x = "Minimum Nights Stay Requirement",
       y = "Count",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

1.2.1 Maximum nights

So far we have filtered the data to include only the listings, which offer a stay with a minimum length of 1-4 nights. However, it is possible that these include apartments which offer maximum stays of 1-3 days. We need to remove these listings and keeping only those that have maximum stays of 4 nights or more.

listings %>% 
  filter(maximum_nights<=3) %>% 
  count() 
n
256
listings <- listings %>% 
  filter(maximum_nights >= 4)

favstats(~maximum_nights, data = listings)
minQ1medianQ3maxmeansdnmissing
4251.12e+031.12e+035e+03621532128240
listings %>% 
  ggplot(aes(x = maximum_nights)) +
  geom_histogram(fill = "light blue",
                 colour = "grey") +
  labs(title = "Distribution of Maximum Stay Requirements",
       x = "Maximum Nights Stay Requirement",
       y = "Count",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

We have now removed the 256 listings that had a stay requirement that was too short.

1.2.2 Accommodates

Since we are considering apartments for 2, we also need to filter the data to include only apartments which accommodate at least 2 people. First let’s look at summary statistics of the accommodates variable.

favstats(~accommodates, data=listings)

minQ1medianQ3maxmeansdnmissing
0223162.771.71128240
Strangely, there is an apartment in the dataset which claims to accommodate no one. We will leave this issue for now as we will filter out apartments to include only those that accommodate 2 or more people later.

There is also an apartment in the dataset which claims to accommodate 16 people. It’s possible that this is an outlier so we will plot a histogram and boxplot to determine whether this is the case.

listings %>% 
  ggplot(aes(x = accommodates)) +
  geom_histogram(fill = "light blue",
                 color = "grey") +
  labs(title = "Distribution of people accommodated per Airbnb Apartment in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Number of people accommodated",
       y = "Count",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

The distribution has a positive skew.

listings %>% 
  ggplot(aes(x = accommodates)) +
  geom_boxplot(fill="light blue") +
  labs(title = "Distribution of people accommodated per Airbnb Apartment in Berlin",
       x = "Number of people accommodated",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

The boxplot shows that there are several outliers. Let’s check how many points are classified as outliers and remove some of them.

accommodates_outliers<-boxplot(listings$accommodates)$out
length(accommodates_outliers) #There are a fair number of 'outliers'. Let's look at these more closely.
[1] 1318
accommodates_outliers %>% table() #There are a fairly large number of apartments which accommodate 8 people. 
.
  0   5   6   7   8   9  10  11  12  13  14  15  16 
  6 419 501  93 129  34  52  11  28  10  13   5  17 
#We will only discard apartments which accommodate 9 or more people.
  
listings<-listings %>% 
  mutate(accommodates = if_else(accommodates>=9,NA_real_,accommodates)
  )

Since we are interested only in apartments for 2 people, we keep only the listings that accommodate at least 2 people.

listings<-listings %>% 
  filter(accommodates>=2)

1.2.3 Price

Now that we have filtered the data, let’s look at our dependent variable price in more detail:

favstats(~price, data = listings)
minQ1medianQ3maxmeansdnmissing
03960908e+0378.8118112510

We can see that none of the price observations are missing, however the minimum price per night is 0, which seems suspicious. We will investigate this further and see how many observations have a price of 0.

listings %>% 
  filter(price==0) %>% 
  count()

n
2
There are 2 observations where price takes the value 0.

Next we are going to create a new variable price_4_nights, which represents the price we would have to pay to stay at an Airbnb property for 4 nights. We also exclude any properties with negative or 0 prices.

listings <- listings %>% 
  filter(price > 0) %>%
  mutate(price_4_nights = 4*price)

It is important to know the distribution of our dependent variable. To check this, we will plot a histogram of price_4_nights with the code below:

listings %>% 
  ggplot(aes(x = price_4_nights)) +
  geom_histogram(fill = "light blue", colour = "grey") +
  labs(title = "Distribution of Prices of Airbnb Apartment in Berlin for 4 nights",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Price per 4 nights",
       y="Count",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

As we can see, the data on prices is heavily right-skewed. This can hurt the interpretative power of our model and we would need to address it before continuing with our regression analysis.

listings %>% 
  ggplot(aes(x = price_4_nights)) +
  geom_boxplot(color="light blue") +
  labs(title = "Distribution of Prices of Airbnb Apartment in Berlin for 4 nights",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       caption="Data from: http://insideairbnb.com/get-the-data.html",
       x = "Price per 4 Nights") +
  theme_bw()

Additionally, we can see from the boxplot that there are some outliers. This will be addressed below.

To deal with the distribution of prices, we will transform the price_4_nights using a log function.

listings <- listings %>% 
  mutate(log_price_4_nights = log(price_4_nights))

listings %>% 
  ggplot(aes(x = log_price_4_nights)) +
  geom_histogram(fill = "light blue",
                colour = "grey") +
  geom_vline(aes(xintercept = median(log_price_4_nights)),
            color ="black",
            linetype ="dashed",
            size = 1)+
  labs(title = "Distribution of Log Prices of Airbnb Apartment in Berlin for 4 nights",
       subtitle = "Data includes only apartments that accomodate at least 2 people",
       x = "Log of Price per 4 nights",
       y = "",
       caption = "Data from: http://insideairbnb.com/get-the-data.html") + 
  theme_bw()

listings %>% 
  ggplot(aes(x = log_price_4_nights)) +
  geom_density(fill = "light blue") +
  geom_vline(aes(xintercept = median(log_price_4_nights)),
            color = "black",
            linetype="dashed", 
            size=1) +
  labs(title = "Distribution of Log Prices of Airbnb Apartment in Berlin for 4 nights",
       subtitle = "Data includes only apartments that accomodate at least 2 people",
       x = "Log of Price per 4 nights",
       y = "",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

As we can see, applying a log transformation made the distribution of prices much more like a normal distribution, although there is still a slight right-skew.

listings %>% 
  ggplot(aes(x = log_price_4_nights)) +
  geom_boxplot(fill="light blue") +
  labs(title = "Distribution of Log Prices of Airbnb Apartment in Berlin for 4 nights",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x="Log of Price per 4 Nights",
       caption="Data from: http://insideairbnb.com/get-the-data.html") + 
  theme_bw()

Once we log transform the price variable, there are still several outliers. We will identify these and remove them from our dataset.

logprice_outliers <- boxplot(listings$log_price_4_nights)$out
listings<-listings %>% 
  mutate(log_price_4_nights = if_else(log_price_4_nights %in% logprice_outliers,
                                             NA_real_,
                                              log_price_4_nights
                                    )
  )

1.2.4 Property types

Next, we look at the variable property_type and use the count function to determine how many categories there are, their frequency and their proportions.

listings %>% 
  count(property_type, sort=TRUE) %>% 
  count() #There are 57 categories.
n
57
property_counts <- listings %>% 
  count(property_type, sort=TRUE) %>%
  mutate(sum_n=sum(n)) %>% 
  slice_max(order_by=n, n=4) %>% 
  mutate(prop=n/sum_n)

property_counts
property_typensum_nprop
Entire rental unit5314112490.472 
Private room in rental unit4032112490.358 
Entire condominium (condo)283112490.0252
Entire serviced apartment227112490.0202
property_counts %>% 
  summarise(total_prop = sum(prop))

total_prop
0.876
We can see that there are 57 different property types. The top 4 most common property types are entire rental units (47.2%), private rooms in rental units (35.8%), entire condos (2.5%) & entire serviced apartments (2.0%), which totals 87.6% of the data.

Since the vast majority of the observations in the data are one of the top four or five property types, we create a simplified version of property_type variable that has 5 categories: the top four categories and Other.

listings <- listings %>%
            mutate(prop_type_simplified = case_when(
            property_type %in% c("Entire rental unit",
                                 "Private room in rental unit", 
                                 "Entire condominium (condo)",
                                 "Entire serviced apartment") ~ property_type, 
                                  TRUE ~ "Other"
                                          )
            )

We use the code below to check that prop_type_simplified was correctly made.

listings %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit5314
Private room in rental unitPrivate room in rental unit4032
Entire condominium (condo)Entire condominium (condo)283
Entire serviced apartmentEntire serviced apartment227
Entire loftOther201
Private room in residential homeOther160
Room in hotelOther134
Private room in condominium (condo)Other130
Entire residential homeOther104
Room in boutique hotelOther78
Private room in bed and breakfastOther54
Private room in loftOther54
Entire guesthouseOther49
Shared room in hostelOther49
Private room in townhouseOther43
Shared room in rental unitOther43
Private room in hostelOther30
Room in aparthotelOther29
Room in serviced apartmentOther28
Entire guest suiteOther23
Private room in serviced apartmentOther19
HouseboatOther16
Entire bungalowOther15
Private roomOther14
Private room in pensionOther11
Entire townhouseOther10
Private room in guesthouseOther9
BoatOther8
Tiny houseOther8
Camper/RVOther7
Private room in guest suiteOther7
Private room in casa particularOther5
Private room in villaOther5
Room in hostelOther5
Entire cottageOther4
Entire placeOther4
Private room in boatOther4
Shared room in boutique hotelOther4
Entire cabinOther3
Entire villaOther3
Shared room in condominium (condo)Other3
Private room in bungalowOther2
Private room in cottageOther2
Room in bed and breakfastOther2
TreehouseOther2
CastleOther1
Entire chaletOther1
Entire home/aptOther1
Private room in caveOther1
Private room in floorOther1
Private room in houseboatOther1
Private room in tiny houseOther1
Private room in tipiOther1
Shared room in boatOther1
Shared room in loftOther1
Shared room in townhouseOther1
Shipping containerOther1

We also make a bar plot.

listings %>% 
  ggplot(aes(y=fct_rev(fct_infreq(prop_type_simplified)))) +
  geom_bar(fill = "light blue")+
  labs(
    title="Bar Plot of Simplified Property Type",
    y="Property Type (Simplified)",
    x="Count"
  )+
  NULL

Now, we can see that the data is much more evenly distributed.

1.2.5 Bathrooms

The bathrooms data is all missing, so we will construct a bathrooms variable using bathrooms_text.

listings<-listings %>% 
  mutate(bathrooms_num=parse_number(bathrooms_text))

We will now look at the distribution of the bathrooms_num variable.

favstats(~bathrooms_num, data=listings)
minQ1medianQ3maxmeansdnmissing
011181.10.3111120247
listings %>% 
  ggplot(aes(x = bathrooms_num)) +
  geom_histogram(fill = "light blue",
                 colour = "grey") +
  labs(title = "Distribution of Bathrooms of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Number of bathrooms",
       y = "Count",
       caption="Data from: http://insideairbnb.com/get-the-data.html") + 
  theme_bw()

listings %>% 
  filter(bathrooms_num==0) %>% 
  count()
n
42

Surprisingly, some apartments have no bathrooms. Perhaps Airbnb classifies toilets and bathrooms separately, or these apartments have access to public toilets in the vicinity.

The data also looks right-skewed. We will make a boxplot to determine whether these extreme points are outliers.

listings %>% 
  ggplot(aes(x = bathrooms_num)) +
  geom_boxplot(color="light blue") +
  labs(title = "Boxplot of Bathrooms of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Number of bathrooms",
       caption="Data from: http://insideairbnb.com/get-the-data.html") + 
  theme_bw()

bathrooms_outliers<-boxplot(listings$bathrooms_num)$out
length(bathrooms_outliers) #There are a large number of 'outliers'. Let's look at these more closely.
[1] 1465
bathrooms_outliers %>% table() #Some of these listings have been classified as outliers despite having a reasonable number of bathrooms. However it is unlikely that 2 
.
  0 1.5   2 2.5   3 3.5   4 4.5   8 
 42 739 578  77  18   3   4   3   1 
#We will only discard apartments with 11 or more beds.
  
listings<-listings %>% 
  mutate(beds = if_else(beds>=11,NA_real_,beds)
  )

Some of these listings have been classified as outliers despite having a reasonable number of bathrooms. However, 8 bathrooms is questionable. It is unlikely that 2 people would require an 8 bathroom apartment for a 4-night stay, so for our analysis we will remove this entry.

listings<-listings %>% 
  mutate(bathrooms_num = if_else(bathrooms_num>=8,NA_real_,bathrooms_num)
  )

1.2.6 Bed

We will now look at the distribution of the beds variable.

favstats(~beds, data=listings)
minQ1medianQ3maxmeansdnmissing
0112101.631.1311119130
listings %>% 
  ggplot(aes(x = beds)) +
  geom_histogram(fill = "light blue") +
  labs(title = "Distribution of Beds of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Number of beds",
       y = "Count",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

The data is right-skewed. There is even an apartment in the dataset which claim to have 10 beds. While this is possible, they may also be outliers. We will look at the boxplot.

listings %>% 
  ggplot(aes(x = beds)) +
  geom_boxplot(fill="light blue") +
  labs(title = "Boxplot of Beds of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Number of beds",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

The boxplot confirms that the data is right-skewed. Next we check for and remove some outliers.

beds_outliers<-boxplot(listings$beds)$out
length(beds_outliers) #There are a large number of 'outliers'. Let's look at these more closely.
[1] 768
beds_outliers %>% table() #There are still a large number of apartments with 4-8 beds, so we will not discard these.
.
  4   5   6   7   8   9  10 
429 153 130  26  28   1   1 
#We will only discard apartments with 9 or more beds.
  
listings<-listings %>% 
  mutate(beds = if_else(beds>=9,NA_real_,beds)
  )

1.2.7 Bedrooms

We now perform the same analysis on the bedrooms variable.

favstats(~bedrooms, data=listings)
minQ1medianQ3maxmeansdnmissing
1111101.240.55410346903
listings %>% 
  ggplot(aes(x = bedrooms)) +
  geom_histogram(fill = "light blue") +
  labs(title = "Distribution of Bedrooms of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Number of bedrooms",
       y = "Count",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

Once again, the data is heavily right-skewed.

listings %>% 
  ggplot(aes(x = bedrooms)) +
  geom_boxplot(fill = "light blue") +
  labs(title = "Boxplot of Bedrooms of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Number of bedrooms",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

This skewness is confirmed in the boxplot.

bedrooms_outliers<-boxplot(listings$bedrooms)$out
length(bedrooms_outliers) #There are a large number of 'outliers'. Let's look at these more closely.
[1] 2019
bedrooms_outliers %>% table() #Even apartments with two bedrooms have been categorised as outliers, despite there being a large number in the dataset.
.
   2    3    4    5    6   10 
1601  351   57    4    5    1 
#We will only discard apartments with 10 or more bedrooms.
  
listings<-listings %>% 
  mutate(bedrooms = if_else(bedrooms>=10,NA_real_,bedrooms)
  )

1.2.8 Number of reviews

Next, we will look at the variable number_of_reviews in more detail:

listings %>% 
  ggplot(aes(x = number_of_reviews)) +
  geom_density(fill = "light blue") +
  labs(title = "Distribution of Reviews of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Number of reviews",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

The distribution is heavily right-skewed.

Next we will look at the outliers.

listings %>% 
  ggplot(aes(x = number_of_reviews)) +
  geom_boxplot(fill = "light blue") +
  labs(title = "Boxplot of Reviews of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Number of reviews",
       caption="Data from: http://insideairbnb.com/get-the-data.html") + 
  theme_bw()

numreviews_outliers<-boxplot(listings$number_of_reviews)$out
length(numreviews_outliers)
[1] 1567
#The boxplot indicates that there are a huge number of outliers.

The boxplot indicates that there are a huge number of technical outliers. We will not remove these. Perhaps some apartments have more reviews than others because they are in extremely popular locations, or perhaps they are larger apartments and each guest filled out a review.

1.2.9 Reviews per month

Similarly, we will also consider the number of reviews per month.

listings %>% 
  ggplot(aes(x = reviews_per_month)) +
  geom_density(fill = "light blue") +
  labs(title = "Distribution of Reviews per month of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Reviews per month",
       y = "Density",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

As expected, this follows a similar shape to number_of_reviews.

listings %>% 
  ggplot(aes(x = reviews_per_month)) +
  geom_boxplot(fill = "light blue") +
  labs(title = "Boxplot of Reviews of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Reviews per month",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

numreviews_outliers<-boxplot(listings$reviews_per_month)$out
length(numreviews_outliers)
[1] 767
#The boxplot indicates that there are a huge number of outliers.

Once again, there are a huge number of technical outliers, which we will not remove.

1.2.10 Review Scores

There are several measures of review scores in the dataset. To compare these, we start by pivoting the data into long format.

reviews_long<-listings %>% 
              select(contains("review_score")) %>% 
              pivot_longer(
                cols=1:ncol(.),
                names_to="review_type",
                values_to="review_scores")

Next we look at summary statistics of the variables

favstats(review_scores~review_type, data=reviews_long)

review_typeminQ1medianQ3maxmeansdnmissing
review_scores_accuracy04.754.91554.790.39 94181831
review_scores_checkin04.8 4.95554.830.37394141835
review_scores_cleanliness04.5 4.8 554.640.50394211828
review_scores_communication04.8 4.96554.830.37294201829
review_scores_location04.674.86554.760.36494141835
review_scores_rating04.614.83554.650.71495761673
review_scores_value04.554.75554.670.42694121837
The variables have quite similar means, however some variables such as review_scores_rating and reviews_scores_cleanliness have higher standard deviations.

Next, we plot these variables and compare distributions.

reviews_long %>% 
  ggplot(aes(x = review_scores, fill=review_type)) +
  geom_density() +
  facet_wrap(~review_type, ncol=3)+
  labs(title = "Distribution of Review Scores of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Review Scores",
       y = "Density",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw() +
  theme(legend.position="none")

At first glance, the variables look fairly similar, with most reviews falling in the range 4-5.

listings %>% 
  select(contains("review_score")) %>%
  ggpairs()

The correlation plot further confirms that there is some degree of correlation between the variables.

1.2.11 Neighbourhoods

Next, we want to explore the 3 variables related to neighborhoods neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed to understand which variable describes better the actual neighbourhoods in Berlin.

listings %>% 
  select("neighbourhood","neighbourhood_cleansed","neighbourhood_group_cleansed") %>% 
  head()

neighbourhoodneighbourhood_cleansedneighbourhood_group_cleansed
Berlin, GermanyHelmholtzplatzPankow
Berlin, GermanyPrenzlauer Berg SüdwestPankow
Berlin, GermanyBrunnenstr. SüdMitte
Berlin, GermanyHelmholtzplatzPankow
Moabit OstMitte
Berlin, GermanyBrunnenstr. SüdMitte
The ‘neighbourhood_group_cleansed’ variable is the most appropriate to evaluate further, as it groups the correct unique neighbourhoods in Berlin.

Next, we want to see how many different neighborhoods we have in the dataset and how frequently they appear. We will be looking at the neighbourhood_group_cleansed variable.

listings %>%
  count(neighbourhood_group_cleansed, sort=TRUE) %>% 
  count() #There are 12 categories.
n
12
neighbourhood_counts <- listings %>% 
  count(neighbourhood_group_cleansed, sort=TRUE) %>%
  mutate(sum_n=sum(n)) %>% 
  slice_max(order_by=n, n=6) %>% 
  mutate(prop=n/sum_n)

neighbourhood_counts
neighbourhood_group_cleansednsum_nprop
Friedrichshain-Kreuzberg2611112490.232 
Mitte2427112490.216 
Pankow1804112490.16  
Neukölln1549112490.138 
Charlottenburg-Wilm.860112490.0765
Tempelhof - Schöneberg698112490.062 
neighbourhood_counts %>% 
  summarise(total_prop = sum(prop))
total_prop
0.884

We have a total of 12 neighbourhoods, however 6 of them represent 88.4% of all the observations. From our knowledge of Berlin, these 6 neighbourhoods are indeed the most popular ones while the others are more on the outskirts of the city, therefore, we will group the other 6 ones into the “Other” category.

listings <- listings %>%
    mutate(neighbourhood_simplified = case_when(
    neighbourhood_group_cleansed %in% c("Friedrichshain-Kreuzberg",
                         "Mitte", 
                         "Pankow",
                         "Neukölln",
                         "Charlottenburg-Wilm.",
                         "Tempelhof - Schöneberg") ~ neighbourhood_group_cleansed, 
                        TRUE ~ "Other"
  ))

We can check our output below.

listings %>%
  count(neighbourhood_group_cleansed, neighbourhood_simplified) %>%
  arrange(desc(n))

neighbourhood_group_cleansedneighbourhood_simplifiedn
Friedrichshain-KreuzbergFriedrichshain-Kreuzberg2611
MitteMitte2427
PankowPankow1804
NeuköllnNeukölln1549
Charlottenburg-Wilm.Charlottenburg-Wilm.860
Tempelhof - SchönebergTempelhof - Schöneberg698
Treptow - KöpenickOther387
LichtenbergOther379
Steglitz - ZehlendorfOther217
ReinickendorfOther155
Marzahn - HellersdorfOther86
SpandauOther76
Now let’s look at this distribution properties in neighbourhoods visually.

listings %>% 
  ggplot(aes(y = fct_rev(fct_infreq(neighbourhood_simplified)))) +
  geom_bar(fill = "light blue") +
  labs(title = "Distribution of Airbnb Apartments in Berlinof Neighbourhoods ",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       y = "Neighbourhood",
       x = "Count",
       caption="Data from: http://insideairbnb.com/get-the-data.html") + 
  theme_bw()

### Superhost

In our analysis, we will also consider whether being a host_is_superhost is an important factor in determining the price of a 4-night stay. Let’s look at the proportions of listings which fall into each category.

listings %>% 
  count(host_is_superhost) %>% 
  mutate(total=sum(n),
         prop=n/total)

host_is_superhostntotalprop
FALSE9280112490.825  
TRUE1957112490.174  
12112490.00107
Only about 17% of the listings are made by super hosts.

1.2.12 Availability 30

Finally, we consider the availability_30 variable. We expect this variable to range from 0-30, corresponding to the number of the days the apartment is available in the next 30 days. This variable may be a good predictor of price as it indicates the relative demand for an apartment - rooms which have low availability over the next 30 days are likely to have higher demand and can therefore charge higher prices.

favstats(~availability_30, data=listings)

minQ1medianQ3maxmeansdnmissing
0004304.027.72112490
An examination of the summary statistics shows that the minimum is 0 and the maximum is 30, as we would expect. Interestingly, the mean is approximately 4, indicating that apartments are typically booked out for the majority of the next 30 day period.

We also consider the distribution of availability.

listings %>% 
  ggplot(aes(x = availability_30)) +
  geom_density(fill="light blue") +
  labs(title = "Distribution of 30-Day Availability of Airbnb Apartments in Berlin",
       subtitle = "Data includes only apartments that accommodate at least 2 people",
       x = "Availability in next 30 days",
       y="Density",
       caption="Data from: http://insideairbnb.com/get-the-data.html") +
  theme_bw()

The distribution has a strong positive skew.

1.3 Correlation of selected continous variables with “log_price_4_nights”

Next, we will look at the correlations of continuous variables. As we want to evaluate multiple variables, first we will look at those not related to the reviews.

listings %>% 
  ggpairs(columns = c(
                      "log_price_4_nights",
                      "bedrooms", 
                      "beds",
                      "accommodates",
                      "host_response_rate",
                      "host_listings_count",
                      "availability_30",
                      "bathrooms_num",
                      "calculated_host_listings_count"),
          upper = list(continuous = "cor"),
          lower = list(continuous = "points"))

We observe that “log_price_4_nights” has the highest positive correlation with the variables: “bedrooms” (0.428), “beds” (0.374), “accommodate” (0.506), “availability_30” (0.354), “bathrooms_num” (0.137), and “calculated_host_listings_count” (0.207).

In the next step, we will evaluate correlations between the “log_price_4_nights” and reviews-related numeric variables:

listings %>% 
  ggpairs(columns = c(
                      "log_price_4_nights",
                      "number_of_reviews",
                      "review_scores_rating",
                      "number_of_reviews_l30d",
                      "reviews_per_month",
                      "review_scores_value",
                      "review_scores_cleanliness",
                      "review_scores_location",
                      "review_scores_checkin",
                      "review_scores_accuracy"),
           upper = list(continuous = "cor"),
          lower = list(continuous = "points"))

We observe that log_price_4_nights has the highest correlation with the variables: number_of_reviews (0.094), review_scores_location (0.078), number_of_reviews_l30d (0.089), reviews_per_month (0.141), review_scores_value (-0.100). However, compared to the numerical variables evaluated above,reviews-related variables seem to have lower correlations with the price.

1.4 Correlation of selected categorical variables with “log_price_4_nights”

Next, we will separately evaluate correlations of categorical variables.

listings %>% 
  ggpairs(columns = c("log_price_4_nights",
                      "prop_type_simplified",
                      "room_type",
                      "neighbourhood_simplified"
                      ))

Looking at the boxplots, we notice that that property_type_simplified and room_type seem to have the highest correlation with the “log_price_4_nights”.

2 Mapping

Visualizations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. We will use the following code to show a map of your city, and overlay all Airbnb coordinates to get an overview of the spatial distribution of properties. For this visualization, we use the leaflet package, which includes a variety of tools for interactive maps and allows for easy zoom in-out, clicking on a point to get the actual listing, etc.

leaflet(data = listings) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "light blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

3 Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.

We have already created a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

3.1 Creating a test set

First, we need to divide our data into a test and training subset. This will allows us to test the predictive power of our model using an out-of-sample test.

#splitting data into training and test sets

set.seed(1234) #setting a random seed to always get the same split 

train_test_split <- initial_split(listings, prop = 0.80) #taking an 80-20 approach (for all the consultants out there)
listings_train <- training(train_test_split)
listings_test <- testing(train_test_split)

Now, our test dataset has 2250 observations and training data has 8999 observations with 79 variables each.

We will set our test data aside and do our analysis using only the training data

# Model 1 - Property Type, Number of Reviews, Review Scores
model1 <- lm(log_price_4_nights ~ prop_type_simplified + 
                                  number_of_reviews + 
                                  review_scores_rating, 
             data = listings_train)

# Since prop_type_simplified is a categorical variable with 5 possible options, the model only shows 4 coefficients. The fifth one (Entire Condo), is incorporated in the intercept

# We construct a table with the regression output
huxtable:: huxreg(list("Model1" = model1), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value'))
Model1
(Intercept)5.677 ***
(0.049)   
prop_type_simplifiedEntire rental unit-0.132 ***
(0.034)   
prop_type_simplifiedEntire serviced apartment0.461 ***
(0.053)   
prop_type_simplifiedOther-0.222 ***
(0.037)   
prop_type_simplifiedPrivate room in rental unit-0.844 ***
(0.035)   
number_of_reviews0.001 ***
(0.000)   
review_scores_rating0.036 ***
(0.008)   
#observations7562        
R squared0.369    
Adj. R Squared0.368    
Residual SE0.473    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
model1 %>% 
broom::tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)5.68    0.0489 116   0        5.58    5.77    
prop_type_simplifiedEntire rental unit-0.132   0.0342 -3.850.000117 -0.199   -0.0648  
prop_type_simplifiedEntire serviced apartment0.461   0.0531 8.684.75e-18 0.357   0.565   
prop_type_simplifiedOther-0.222   0.037  -6   2.06e-09 -0.294   -0.149   
prop_type_simplifiedPrivate room in rental unit-0.844   0.0345 -24.5 3.33e-127-0.912   -0.777   
number_of_reviews0.0005769.2e-056.264e-10        0.0003960.000757
review_scores_rating0.0362  0.007584.781.8e-06  0.0214  0.0511  
model1 %>% 
   broom::glance()

r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.3690.3680.47373606-5.07e+031.02e+041.02e+041.69e+0375557562
Overall, Model 1 has p-value of 0, which means it is statistically significant. Furthermore, according to R squared, our model explains 36.9% of the variation in log_price_4_nights with an adjusted R squared of 36.8%. Additionally, all the variables used are significant at the 5% level.

Since the variable property_type is categorical with 5 possible options, the model only shows 4 coefficients. The fifth one (Entire Condo), is incorporated in the intercept. This means the coefficients of the other four are in relation to the fifth one. To interpret the coefficients for a log-transformed dependent variable, we are going to take the exponent of the coefficient. In this case, entire rental units are on average 12.4% cheaper than entire condos for 4 nights. Similarly, entire serviced apartments are 58.6% more expensive than entire condos, whereas private rooms in rental units are 57.0% cheaper than entire condos. Finally, other types of properties are on average 19.9% cheaper than entire condos.

Moreover, we see that one unit change in review_scores_rating corresponds to 3.7% higher price per 4 nights.

Finally, every additional review corresponds to 0.06% higher price per 4 nights. This is quite small but we also have to consider that the unit change in number_of_reviews is a single review. The maximum number of reviews on a property is 655 and we can see that if the unit change was 100, for every 100 additional reviews, the price per 4 nights would increase by 5.8%, which looks much more substantial.

# Model 2 - Model 1 + Room Type

model2 <- lm(log_price_4_nights ~  prop_type_simplified + 
                                   number_of_reviews + 
                                   review_scores_rating +
                                   room_type,
             data = listings_train)

# Since room_type is a categorical variable with 4 possible options, the model only shows 3 coefficients. The fourth option (Entire Home/ Apartment) is incorporated in the intercept

# We construct a table with the regression output
huxtable:: huxreg(list("Model2" = model2), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value'))
Model2
(Intercept)5.713 ***
(0.047)   
prop_type_simplifiedEntire rental unit-0.133 ***
(0.033)   
prop_type_simplifiedEntire serviced apartment0.459 ***
(0.051)   
prop_type_simplifiedOther0.074    
(0.041)   
prop_type_simplifiedPrivate room in rental unit-0.310 ***
(0.048)   
number_of_reviews0.001 ***
(0.000)   
review_scores_rating0.029 ***
(0.007)   
room_typeHotel room0.440 ***
(0.061)   
room_typePrivate room-0.535 ***
(0.034)   
room_typeShared room-0.906 ***
(0.060)   
#observations7562        
R squared0.412    
Adj. R Squared0.411    
Residual SE0.457    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
model2 %>% 
broom::tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)5.71    0.0473 121   0       5.62   5.81    
prop_type_simplifiedEntire rental unit-0.133   0.033  -4.016.07e-05-0.197  -0.0678  
prop_type_simplifiedEntire serviced apartment0.459   0.0513 8.954.27e-190.359  0.56    
prop_type_simplifiedOther0.0737  0.0412 1.790.0737  -0.007070.155   
prop_type_simplifiedPrivate room in rental unit-0.31    0.0477 -6.5 8.44e-11-0.404  -0.217   
number_of_reviews0.0005758.9e-056.461.13e-100.0004 0.000749
review_scores_rating0.0287  0.007333.928.96e-050.0144 0.0431  
room_typeHotel room0.44    0.0612 7.2 6.83e-130.32   0.56    
room_typePrivate room-0.535   0.0342 -15.6 2.58e-54-0.602  -0.468   
room_typeShared room-0.906   0.0598 -15.2 3.43e-51-1.02   -0.789   
model2 %>% 
   broom::glance()

r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.4120.4110.45758709-4.8e+039.63e+039.7e+031.58e+0375527562
Model 2 is also statistically significant, with a p-value of 0. Additionally, it explains 41.2% of the variation in log_price_4_nights, which is higher than Model 1. All variables are statistically significant aside from prop_type_simplifiedOther. To compare the two models, however, we need to look at the adjusted r-squared, since Model 2 includes one additional variable. Model 2 has an adjusted r-squared of 41.11% compared to 36.83% for Model 1. We can conclude that Model 2 has a higher predictive power than Model 1.

Since room_type is a categorical variable with 4 possible options, the model only shows 3 coefficients. The fourth option (Entire Home/ Apartment) is incorporated in the intercept. To interpret the coefficients for a log-transformed dependent variable, we are going to take the exponent of the coefficient. This means that hotel rooms, for example, are on average 55.27% more expensive than entire home/apartments. Similarly, private rooms 41.4% cheaper, whereas shared rooms are 59.6% cheaper than entire homes.

Now, before we continue with additional model specifications, we are going to perform a VIF test to check for multicollinearity:

autoplot(model2)

car::vif(model2)
                          GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 11.942272  4        1.363439
number_of_reviews     1.018235  1        1.009076
review_scores_rating  1.009018  1        1.004499
room_type            11.923753  3        1.511479

We can see that room_type and property_type_simplified have VIF scores of more than 10, which means there is potential multicollinearity problem in our model. Before we proceed with the analysis, we will test which variable has a higher predictive power and remove the other:

model2_property <- lm(log_price_4_nights ~  prop_type_simplified + 
                                           number_of_reviews + 
                                           review_scores_rating,
                     data = listings_train)
model2_room <- lm(log_price_4_nights ~  number_of_reviews + 
                                       review_scores_rating +
                                       room_type,
                     data = listings_train)

huxtable:: huxreg(list("Property Type Model" = model2_property, 
                       "Room Type Model" = model2_room), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma')
                  ) %>% 
  set_caption('Comparison of Property Type vs. Room Type Models')
Comparison of Property Type vs. Room Type Models
Property Type ModelRoom Type Model
(Intercept)5.677 ***5.638 ***
(0.049)   (0.036)   
prop_type_simplifiedEntire rental unit-0.132 ***        
(0.034)           
prop_type_simplifiedEntire serviced apartment0.461 ***        
(0.053)           
prop_type_simplifiedOther-0.222 ***        
(0.037)           
prop_type_simplifiedPrivate room in rental unit-0.844 ***        
(0.035)           
number_of_reviews0.001 ***0.001 ***
(0.000)   (0.000)   
review_scores_rating0.036 ***0.024 ** 
(0.008)   (0.008)   
room_typeHotel room        0.608 ***
        (0.058)   
room_typePrivate room        -0.701 ***
        (0.011)   
room_typeShared room        -0.742 ***
        (0.056)   
#observations7562        7562        
R squared0.369    0.371    
Adj. R Squared0.368    0.370    
Residual SE0.473    0.472    
*** p < 0.001; ** p < 0.01; * p < 0.05.
model2 %>% 
broom::tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)5.71    0.0473 121   0       5.62   5.81    
prop_type_simplifiedEntire rental unit-0.133   0.033  -4.016.07e-05-0.197  -0.0678  
prop_type_simplifiedEntire serviced apartment0.459   0.0513 8.954.27e-190.359  0.56    
prop_type_simplifiedOther0.0737  0.0412 1.790.0737  -0.007070.155   
prop_type_simplifiedPrivate room in rental unit-0.31    0.0477 -6.5 8.44e-11-0.404  -0.217   
number_of_reviews0.0005758.9e-056.461.13e-100.0004 0.000749
review_scores_rating0.0287  0.007333.928.96e-050.0144 0.0431  
room_typeHotel room0.44    0.0612 7.2 6.83e-130.32   0.56    
room_typePrivate room-0.535   0.0342 -15.6 2.58e-54-0.602  -0.468   
room_typeShared room-0.906   0.0598 -15.2 3.43e-51-1.02   -0.789   
model2 %>% 
   broom::glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.4120.4110.45758709-4.8e+039.63e+039.7e+031.58e+0375527562

As we can see from the table, the model that includes room types instead of property types has both a higher r-squared and adj. r-squared. To avoid multicollinearity, we will continue using that model and will discard property type.

car::vif(model2_room)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.014033  1        1.006992
review_scores_rating 1.007116  1        1.003552
room_type            1.010184  3        1.001690

3.2 Further Model Specifications

Our dataset has many more variables, so we attempt to extend our analysis by adding a number of variables, which could prove useful in increasing the explanatory power of our model.

  1. Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?
  2. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
  3. Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?
  4. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nights
  5. What is the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?
# Model 3 - Adding property-specific characteristics to Model_Rooms. 

model3 <- lm(log_price_4_nights ~ number_of_reviews + 
                                  review_scores_rating +
                                  room_type +
                                  bedrooms +
                                  bathrooms_num+
                                  beds +
                                  accommodates,
                     data = listings_train)

# We construct a table with the regression output
huxtable:: huxreg(list("Model3" = model3), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )
Model3
(Intercept)4.914 ***
(0.041)   
number_of_reviews0.000 ** 
(0.000)   
review_scores_rating0.029 ***
(0.007)   
room_typeHotel room0.717 ***
(0.053)   
room_typePrivate room-0.544 ***
(0.012)   
room_typeShared room-0.891 ***
(0.053)   
bedrooms0.076 ***
(0.013)   
bathrooms_num0.151 ***
(0.018)   
beds0.016 *  
(0.007)   
accommodates0.127 ***
(0.007)   
#observations6903        
R squared0.509    
Adj. R Squared0.509    
Residual SE0.422    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
model3 %>% 
broom::tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)4.91    0.0408  120   0       4.83    4.99    
number_of_reviews0.0002558.73e-052.920.00351 8.37e-050.000426
review_scores_rating0.0287  0.00729 3.938.49e-050.0144  0.043   
room_typeHotel room0.717   0.0533  13.4 1.14e-400.612   0.821   
room_typePrivate room-0.544   0.0116  -47.1 0       -0.567   -0.522   
room_typeShared room-0.891   0.0526  -16.9 4.76e-63-0.994   -0.787   
bedrooms0.0763  0.013   5.855.02e-090.0507  0.102   
bathrooms_num0.151   0.0184  8.183.36e-160.114   0.187   
beds0.0163  0.00697 2.340.0191  0.00267 0.03    
accommodates0.127   0.00676 18.8 6.4e-77 0.114   0.14    
model3 %>% 
   broom::glance()
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.5090.5090.42279509-3.84e+037.7e+037.77e+031.23e+0368936903

We see from the table that Model 3 is statistically significant and adding information about property amenities like number of bedrooms and how many people a property can accommodate greatly increases the predictive power of the model. We see an r-squared of 50.9% and an adjusted r-squared of 50.9%. Looking at the statistical significance of the explanatory variables, we find that they are all statistically significant at least a 5% level.

However, we need be weary of potential multicollinearity. Logically, it could expected that bedrooms, beds, and accommodates would be highly correlated. Therefore, we test for multicollinearity by running a VIF-test.

# We test for multicollinearity
car::vif(model3)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.032847  1        1.016291
review_scores_rating 1.008821  1        1.004401
room_type            1.365954  3        1.053350
bedrooms             1.926766  1        1.388080
bathrooms_num        1.116281  1        1.056542
beds                 2.577324  1        1.605405
accommodates         3.195903  1        1.787709

Although the results don’t suggest there is multicollinearity problem with Model 3, we will attempt to remove accommodates to see the effect:

# Model 4 A - Model 3 without `accommodates`

model4_a <- lm(log_price_4_nights ~ number_of_reviews + 
                                  review_scores_rating +
                                  room_type +
                                  bedrooms +
                                  bathrooms_num+
                                  beds,
                     data = listings_train)

# We construct a table with the regression output
huxtable:: huxreg(list("No-Accommodates Model" = model4_a), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )
No-Accommodates Model
(Intercept)5.057 ***
(0.041)   
number_of_reviews0.000 ***
(0.000)   
review_scores_rating0.027 ***
(0.007)   
room_typeHotel room0.686 ***
(0.055)   
room_typePrivate room-0.602 ***
(0.011)   
room_typeShared room-0.891 ***
(0.054)   
bedrooms0.167 ***
(0.012)   
bathrooms_num0.163 ***
(0.019)   
beds0.096 ***
(0.006)   
#observations6903        
R squared0.484    
Adj. R Squared0.484    
Residual SE0.433    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
car::vif(model4_a)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.028887  1        1.014340
review_scores_rating 1.008604  1        1.004293
room_type            1.269852  3        1.040620
bedrooms             1.662095  1        1.289223
bathrooms_num        1.114865  1        1.055872
beds                 1.631100  1        1.277145

The model and all the descriptive statistics prove statistically significant. The predictive power of our model has decreased, however, from 50.9% adjusted r-squared to 48.4%. We will now try removing beds and running the regression with accommodates instead:

# Model 4 B - Model 3 without `beds`

model4_b <- lm(log_price_4_nights ~ number_of_reviews + 
                                  review_scores_rating +
                                  room_type +
                                  bedrooms +
                                  bathrooms_num+
                                  accommodates,
                     data = listings_train)

# We construct a table with the regression output
huxtable:: huxreg(list("No Beds Model" = model4_b), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )
No Beds Model
(Intercept)4.907 ***
(0.041)   
number_of_reviews0.000 ** 
(0.000)   
review_scores_rating0.028 ***
(0.007)   
room_typeHotel room0.719 ***
(0.053)   
room_typePrivate room-0.543 ***
(0.012)   
room_typeShared room-0.853 ***
(0.051)   
bedrooms0.081 ***
(0.013)   
bathrooms_num0.154 ***
(0.018)   
accommodates0.136 ***
(0.005)   
#observations6947        
R squared0.510    
Adj. R Squared0.509    
Residual SE0.422    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
car::vif(model4_b)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.030454  1        1.015113
review_scores_rating 1.008318  1        1.004150
room_type            1.307427  3        1.045690
bedrooms             1.892493  1        1.375679
bathrooms_num        1.117572  1        1.057153
accommodates         2.029664  1        1.424663

The predictive power of Model 4_b is higher than Model 4_a and all variables are statistically significant at the 5% level. Comparing with model 3 adjusted R-squared, we see that we arrive at a similar explanatory power of 50.9 % with a reduced number of explanatory variables. Furthermore, we see that the VIF-analysis returns a lower level of multicollinearity. Thus, we move forward with model 4_b, renaming it to model4.

# Model 5 - Model 4 + host_is_superhost


# Renaming Model 4 B to Model 4
model4 <- model4_b

model5 <- lm(log_price_4_nights ~ number_of_reviews + 
                                  review_scores_rating +
                                  room_type +
                                  bedrooms +
                                  bathrooms_num+
                                  accommodates +
                                  host_is_superhost,
                     data = listings_train)

# We construct a table with the regression output
huxtable:: huxreg(list("Model5" = model5), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )
Model5
(Intercept)4.926 ***
(0.041)   
number_of_reviews0.000    
(0.000)   
review_scores_rating0.023 ** 
(0.007)   
room_typeHotel room0.712 ***
(0.053)   
room_typePrivate room-0.544 ***
(0.011)   
room_typeShared room-0.834 ***
(0.051)   
bedrooms0.083 ***
(0.013)   
bathrooms_num0.150 ***
(0.018)   
accommodates0.135 ***
(0.005)   
host_is_superhostTRUE0.080 ***
(0.014)   
#observations6942        
R squared0.512    
Adj. R Squared0.512    
Residual SE0.421    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
car::vif(model5)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.200211  1        1.095542
review_scores_rating 1.022280  1        1.011079
room_type            1.314144  3        1.046583
bedrooms             1.893906  1        1.376193
bathrooms_num        1.118330  1        1.057511
accommodates         2.033802  1        1.426114
host_is_superhost    1.200927  1        1.095868

Model 5 is statistically significant, with a p-value of 0. Additionally, Model 5 has a slightly better predictive power than Model 4, although the difference in adj. r-squared is very small (0.3%). The new variable we added, host_is_superhost, is statistically significant, but now the number_of_reviews is not. One potential explanation is that the the number of reviews has different effect on the price depending on whether the host is a superhost or not. To deal with this, we will create an interaction term between host_is_superhost and number_of_reviews:

model5_inter <- lm(log_price_4_nights ~ number_of_reviews + 
                                  review_scores_rating +
                                  room_type +
                                  bedrooms +
                                  bathrooms_num+
                                  accommodates+
                                  host_is_superhost+
                                  host_is_superhost*number_of_reviews,
                     data = listings_train)

# We construct a table with the regression output
huxtable:: huxreg(list("Superhost Model + Interaction" = model5_inter), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )
Superhost Model + Interaction
(Intercept)4.928 ***
(0.041)   
number_of_reviews0.000 ** 
(0.000)   
review_scores_rating0.022 ** 
(0.007)   
room_typeHotel room0.707 ***
(0.053)   
room_typePrivate room-0.544 ***
(0.011)   
room_typeShared room-0.840 ***
(0.051)   
bedrooms0.082 ***
(0.013)   
bathrooms_num0.149 ***
(0.018)   
accommodates0.135 ***
(0.005)   
host_is_superhostTRUE0.111 ***
(0.017)   
number_of_reviews:host_is_superhostTRUE-0.001 ***
(0.000)   
#observations6942        
R squared0.513    
Adj. R Squared0.512    
Residual SE0.421    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
car::vif(model5_inter)
                                        GVIF Df GVIF^(1/(2*Df))
number_of_reviews                   2.310959  1        1.520184
review_scores_rating                1.023361  1        1.011613
room_type                           1.317293  3        1.047001
bedrooms                            1.894883  1        1.376547
bathrooms_num                       1.119087  1        1.057869
accommodates                        2.035820  1        1.426822
host_is_superhost                   1.775598  1        1.332516
number_of_reviews:host_is_superhost 3.271314  1        1.808677
model5_inter %>% 
broom::tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)4.93    0.0406  121   0        4.85    5.01    
number_of_reviews0.0003570.00013 2.740.0061   0.0001020.000611
review_scores_rating0.0223  0.0073  3.050.00227  0.00798 0.0366  
room_typeHotel room0.707   0.0532  13.3 7.62e-40 0.603   0.811   
room_typePrivate room-0.544   0.0115  -47.3 0        -0.566   -0.521   
room_typeShared room-0.84    0.0511  -16.4 1.15e-59 -0.94    -0.74    
bedrooms0.0821  0.0129  6.381.83e-10 0.0569  0.107   
bathrooms_num0.149   0.0183  8.144.56e-16 0.113   0.185   
accommodates0.135   0.00536 25.1 2.52e-1330.124   0.145   
host_is_superhostTRUE0.111   0.0169  6.614.07e-11 0.0784  0.144   
number_of_reviews:host_is_superhostTRUE-0.0006160.000187-3.3 0.000961 -0.000982-0.000251

First, we see that including the interaction term made number_of_reviews statistically significant again but the new model has approximately the same predictive power as the old one (51.2% adj. R squared). The way to interpret the new coefficient is as follows: when the host is not superhost, each additional review corresponds to approximately 0.036% higher price. When the host is a superhost, each additional review corresponds to 0.026% lower price. Additionally, being a superhost corresponds to 11.2% higher price when controlling for the number of reviews. Thus, all variables are statistically significant and the predictive power of our model is the highest we have achieved so far.

Next, we would like to include the variable neighbourhood_simplified with the following code:

# Renaming Model 5 Interaction to Model 5
model5 <- model5_inter

model6 <- lm(log_price_4_nights ~ number_of_reviews + 
                                  review_scores_rating +
                                  room_type +
                                  bedrooms +
                                  bathrooms_num+
                                  accommodates +
                                  host_is_superhost +
                                  neighbourhood_simplified,
                     data = listings_train)

# We construct a table with the regression output
huxtable:: huxreg(list("Model6" = model6), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )
Model6
(Intercept)4.951 ***
(0.044)   
number_of_reviews-0.000    
(0.000)   
review_scores_rating0.024 ***
(0.007)   
room_typeHotel room0.700 ***
(0.053)   
room_typePrivate room-0.534 ***
(0.011)   
room_typeShared room-0.868 ***
(0.050)   
bedrooms0.088 ***
(0.013)   
bathrooms_num0.137 ***
(0.018)   
accommodates0.134 ***
(0.005)   
host_is_superhostTRUE0.088 ***
(0.014)   
neighbourhood_simplifiedFriedrichshain-Kreuzberg0.018    
(0.021)   
neighbourhood_simplifiedMitte0.063 ** 
(0.021)   
neighbourhood_simplifiedNeukölln-0.125 ***
(0.023)   
neighbourhood_simplifiedOther-0.152 ***
(0.024)   
neighbourhood_simplifiedPankow0.003    
(0.022)   
neighbourhood_simplifiedTempelhof - Schöneberg-0.060 *  
(0.027)   
#observations6942        
R squared0.527    
Adj. R Squared0.526    
Residual SE0.415    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
car::vif(model6)
                             GVIF Df GVIF^(1/(2*Df))
number_of_reviews        1.209538  1        1.099790
review_scores_rating     1.023609  1        1.011736
room_type                1.338472  3        1.049788
bedrooms                 1.899468  1        1.378212
bathrooms_num            1.125743  1        1.061010
accommodates             2.042343  1        1.429106
host_is_superhost        1.208290  1        1.099222
neighbourhood_simplified 1.054832  6        1.004458
model6 %>% 
broom::tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)4.95    0.0437  113    0        4.86    5.04    
number_of_reviews-5.31e-059.27e-05-0.5730.567    -0.0002350.000129
review_scores_rating0.0239  0.00719 3.32 0.000908 0.00978 0.038   
room_typeHotel room0.7     0.0526  13.3  5.9e-40  0.597   0.803   
room_typePrivate room-0.534   0.0114  -47    0        -0.556   -0.512   
room_typeShared room-0.868   0.0504  -17.2  4.01e-65 -0.967   -0.769   
bedrooms0.0884  0.0127  6.97 3.53e-12 0.0635  0.113   
bathrooms_num0.137   0.0181  7.6  3.27e-14 0.102   0.173   
accommodates0.134   0.00529 25.4  7.95e-1360.124   0.145   
host_is_superhostTRUE0.0878  0.0137  6.41 1.59e-10 0.0609  0.115   
neighbourhood_simplifiedFriedrichshain-Kreuzberg0.018   0.0208  0.8610.389    -0.0229  0.0588  
neighbourhood_simplifiedMitte0.063   0.0212  2.97 0.00298  0.0214  0.105   
neighbourhood_simplifiedNeukölln-0.125   0.0226  -5.51 3.69e-08 -0.169   -0.0804  
neighbourhood_simplifiedOther-0.152   0.0235  -6.46 1.11e-10 -0.198   -0.106   
neighbourhood_simplifiedPankow0.00261 0.0219  0.1190.905    -0.0404  0.0456  
neighbourhood_simplifiedTempelhof - Schöneberg-0.0598  0.0272  -2.2  0.0281   -0.113   -0.00643 

Model 6 is statistically significant and has the highest adjusted r-squared so far at 52.6%. Additionally, 52.7% of the variation in log_price_4_nights can be explained by Model 6. The variable neighbourhood_simplified has 6 possible values: Friedrichshain-Kreuzberg, Mitte, Pankow, Neukölln, Charlottenburg-Wilm, Tempelhof - Schöneberg, and Other. Since neighbourhood_simplified is a categorical variable, only 5 possible outcomes have their own coefficient, and the last one,Charlottenburg-Wilm , is incorporated into the intercept. Although not all neighbourhoods have a statistically significant effect on log price, properties in Mitte are on average 6.5% more expensive, where as properties in Neukolln, Pankow, and Tempelhof are 11.7%, 0.3%, and 5.8% cheaper than Charlottenburg-Wilm respectively.

Since some neighbourhoods have a significant effect on log price and this model has the highest predictive power so far, we will proceed with Model 6.

Next, we would like to add availability_30 or reviews_per_month to our analysis with the following code:

model7 <- lm(log_price_4_nights ~ number_of_reviews + 
                                  review_scores_rating +
                                  room_type +
                                  bedrooms +
                                  bathrooms_num +
                                  accommodates+
                                  host_is_superhost+
                                  host_is_superhost*number_of_reviews+
                                  neighbourhood_simplified+
                                  availability_30+
                                  reviews_per_month,
                     data = listings_train)

# We construct a table with the regression output
huxtable:: huxreg(list("Model 7" = model7), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )
Model 7
(Intercept)4.821 ***
(0.041)   
number_of_reviews-0.000 ** 
(0.000)   
review_scores_rating0.038 ***
(0.007)   
room_typeHotel room0.416 ***
(0.050)   
room_typePrivate room-0.542 ***
(0.011)   
room_typeShared room-1.085 ***
(0.047)   
bedrooms0.108 ***
(0.012)   
bathrooms_num0.132 ***
(0.017)   
accommodates0.112 ***
(0.005)   
host_is_superhostTRUE0.094 ***
(0.016)   
neighbourhood_simplifiedFriedrichshain-Kreuzberg0.076 ***
(0.019)   
neighbourhood_simplifiedMitte0.091 ***
(0.020)   
neighbourhood_simplifiedNeukölln-0.060 ** 
(0.021)   
neighbourhood_simplifiedOther-0.156 ***
(0.022)   
neighbourhood_simplifiedPankow0.050 *  
(0.020)   
neighbourhood_simplifiedTempelhof - Schöneberg-0.032    
(0.025)   
availability_300.023 ***
(0.001)   
reviews_per_month0.010 *  
(0.004)   
number_of_reviews:host_is_superhostTRUE-0.000    
(0.000)   
#observations6942        
R squared0.593    
Adj. R Squared0.592    
Residual SE0.385    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
car::vif(model7)
                                        GVIF Df GVIF^(1/(2*Df))
number_of_reviews                   2.866633  1        1.693113
review_scores_rating                1.032123  1        1.015935
room_type                           1.416314  3        1.059725
bedrooms                            1.907454  1        1.381106
bathrooms_num                       1.126554  1        1.061393
accommodates                        2.078562  1        1.441722
host_is_superhost                   1.799955  1        1.341624
neighbourhood_simplified            1.082130  6        1.006599
availability_30                     1.180472  1        1.086495
reviews_per_month                   1.628604  1        1.276168
number_of_reviews:host_is_superhost 3.299141  1        1.816354
model7 %>% 
broom::tidy(conf.int=TRUE)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)4.82    0.0407  118   0        4.74    4.9     
number_of_reviews-0.0004020.000132-3.040.00241  -0.000662-0.000142
review_scores_rating0.0382  0.0067  5.7 1.28e-08 0.025   0.0513  
room_typeHotel room0.416   0.0495  8.4 5.22e-17 0.319   0.513   
room_typePrivate room-0.542   0.0106  -51.3 0        -0.563   -0.522   
room_typeShared room-1.09    0.0473  -22.9 3.21e-112-1.18    -0.992   
bedrooms0.108   0.0118  9.129.43e-20 0.0845  0.131   
bathrooms_num0.132   0.0168  7.893.55e-15 0.0994  0.165   
accommodates0.112   0.00495 22.7 2.41e-1100.103   0.122   
host_is_superhostTRUE0.0945  0.0155  6.091.2e-09  0.0641  0.125   
neighbourhood_simplifiedFriedrichshain-Kreuzberg0.0762  0.0194  3.928.81e-05 0.0381  0.114   
neighbourhood_simplifiedMitte0.0908  0.0197  4.614.14e-06 0.0521  0.129   
neighbourhood_simplifiedNeukölln-0.0599  0.0211  -2.840.00457  -0.101   -0.0185  
neighbourhood_simplifiedOther-0.156   0.0218  -7.131.1e-12  -0.198   -0.113   
neighbourhood_simplifiedPankow0.05    0.0204  2.450.0143   0.00997 0.0899  
neighbourhood_simplifiedTempelhof - Schöneberg-0.0316  0.0253  -1.250.212    -0.0811  0.018   
availability_300.0225  0.0007  32.1 2.13e-2110.0211  0.0239  
reviews_per_month0.00982 0.00403 2.440.0148   0.00192 0.0177  
number_of_reviews:host_is_superhostTRUE-0.0002130.000171-1.240.215    -0.0005480.000123

Model 7 is statistically significant and has the highest predictive power so far at 59.2% adjusted r-squared. Additionally, 59.3% of the variation in log_price_4_nights is explained by the model. Both availability_30 and reviews_per_month are statistically significant. Look at the VIF test, we see that there is no problem with multicollinearity, which would suggest we should leave both variables in. Looking at number_of_reviews, we see that while the variable’s coefficient was insignificant in Model 6, it is significant now. In other words, controlling for the reviews per month a property receives makes the number of reviews a property has significant. This means we should include reviews_per_month in our model, despite it being not statistically significant.

Following this, we would like to increase the predictive power of our model as much as possible. To do that we will start testing the predictive power of the following variables one by one:

  • host_response_rate
  • host_listings_count
  • maximum_nights
  • number_of_reviews_ltm
  • review_scores_accuracy
  • review_scores_cleanliness
  • review_scores_checkin
  • review_scores_communication
  • review_scores_location
  • review_scores_value
  • longitude
  • latitude
  • number_of_reviews_l30
model8 <- lm(log_price_4_nights ~ number_of_reviews + 
                                  review_scores_rating +
                                  room_type +
                                  bedrooms +
                                  bathrooms_num+
                                  accommodates +
                                  host_is_superhost +
                                  neighbourhood_simplified +
                                  availability_30 +
                                  reviews_per_month +
                                  review_scores_value +
                                  review_scores_cleanliness +
                                  review_scores_location +
                                  review_scores_checkin +
                                  review_scores_accuracy +
                                  latitude +
                                  longitude +
                                  number_of_reviews_l30d +
                                  calculated_host_listings_count,
                     data = listings_train)

# We construct a table with the regression output
huxtable:: huxreg(list("Model8" = model8), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )
Model8
(Intercept)71.062 ***
(11.424)   
number_of_reviews-0.001 ***
(0.000)   
review_scores_rating0.175 ***
(0.022)   
room_typeHotel room0.370 ***
(0.048)   
room_typePrivate room-0.526 ***
(0.010)   
room_typeShared room-1.149 ***
(0.050)   
bedrooms0.112 ***
(0.012)   
bathrooms_num0.124 ***
(0.016)   
accommodates0.108 ***
(0.005)   
host_is_superhostTRUE0.066 ***
(0.012)   
neighbourhood_simplifiedFriedrichshain-Kreuzberg0.101 ***
(0.023)   
neighbourhood_simplifiedMitte0.138 ***
(0.022)   
neighbourhood_simplifiedNeukölln-0.050 *  
(0.024)   
neighbourhood_simplifiedOther-0.100 ***
(0.025)   
neighbourhood_simplifiedPankow0.130 ***
(0.025)   
neighbourhood_simplifiedTempelhof - Schöneberg-0.046    
(0.026)   
availability_300.020 ***
(0.001)   
reviews_per_month0.024 ***
(0.005)   
review_scores_value-0.247 ***
(0.019)   
review_scores_cleanliness0.123 ***
(0.013)   
review_scores_location0.181 ***
(0.016)   
review_scores_checkin-0.034 *  
(0.017)   
review_scores_accuracy-0.055 ** 
(0.020)   
latitude-1.195 ***
(0.212)   
longitude-0.298 ** 
(0.095)   
number_of_reviews_l30d-0.019 ***
(0.004)   
calculated_host_listings_count0.006 ***
(0.001)   
#observations6823        
R squared0.620    
Adj. R Squared0.618    
Residual SE0.372    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
car::vif(model8)
                                   GVIF Df GVIF^(1/(2*Df))
number_of_reviews              1.698150  1        1.303131
review_scores_rating           3.330309  1        1.824914
room_type                      1.685812  3        1.090942
bedrooms                       1.938326  1        1.392238
bathrooms_num                  1.131639  1        1.063785
accommodates                   2.153714  1        1.467554
host_is_superhost              1.231920  1        1.109919
neighbourhood_simplified       4.192234  6        1.126861
availability_30                1.320644  1        1.149193
reviews_per_month              2.637634  1        1.624079
review_scores_value            2.984940  1        1.727698
review_scores_cleanliness      1.988977  1        1.410311
review_scores_location         1.579065  1        1.256609
review_scores_checkin          1.727042  1        1.314170
review_scores_accuracy         2.624735  1        1.620104
latitude                       2.261321  1        1.503769
longitude                      1.733604  1        1.316664
number_of_reviews_l30d         2.107147  1        1.451601
calculated_host_listings_count 1.434748  1        1.197810

This model gives us the highest adjusted R-squared so far (61.8%) and explains 62.0% of the variation in the log price. In addition, all the coefficients except for neighbourhood_simplifiedTempelhof - Schöneberg are statistically significant at the 5% level. However, we will not remove this as this forms part of a categorical variable whose other coefficients are significant.

Despite this, the addition of all these new variables has made the model complicated. In our next model, we observe that when retaining significant and relevant variables, with only 7 variables, we are able to reduce number of explanatory variables by 6 and only drop explanatory power by 4.4 %, which leads to a less complicated and superior model from a complexity vs. explanatory power-tradeoff while avoiding overfitting the model to the training data.

model9 <- lm(log_price_4_nights ~ 
                                  room_type +
                                  accommodates +
                                  bathrooms_num+
                                  host_is_superhost +
                                  availability_30 +
                                  review_scores_value +
                                  review_scores_cleanliness +
                                  review_scores_location,
                     data = listings_train)

# We construct a table with the regression output
huxtable:: huxreg(list("Model9" = model9), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )
Model9
(Intercept)4.321 ***
(0.069)   
room_typeHotel room0.397 ***
(0.049)   
room_typePrivate room-0.536 ***
(0.010)   
room_typeShared room-1.067 ***
(0.047)   
accommodates0.135 ***
(0.004)   
bathrooms_num0.181 ***
(0.016)   
host_is_superhostTRUE0.041 ***
(0.011)   
availability_300.021 ***
(0.001)   
review_scores_value-0.222 ***
(0.015)   
review_scores_cleanliness0.155 ***
(0.012)   
review_scores_location0.217 ***
(0.015)   
#observations7396        
R squared0.582    
Adj. R Squared0.582    
Residual SE0.385    
P-Value0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
car::vif(model9)
                              GVIF Df GVIF^(1/(2*Df))
room_type                 1.320472  3        1.047422
accommodates              1.318854  1        1.148413
bathrooms_num             1.064827  1        1.031905
host_is_superhost         1.046557  1        1.023014
availability_30           1.116529  1        1.056659
review_scores_value       1.952214  1        1.397216
review_scores_cleanliness 1.646925  1        1.283326
review_scores_location    1.390104  1        1.179027

We believe that Model 9, our latest model, strikes a good balance between simplicity and explanatory power and the variables in the model reflect our intuition about predictors of price.

Now, we check the residuals of our models.

autoplot(model1)

autoplot(model2)

autoplot(model3)

autoplot(model4)

autoplot(model5)

autoplot(model6)

autoplot(model7)

autoplot(model8)

autoplot(model9)

huxtable:: huxreg(list(model1,model2, model3, model4, model5, model6, model7, model8, model9), 
                  statistics = c('#observations' = 'nobs', 
                                'R squared' = 'r.squared', 
                                'Adj. R Squared' = 'adj.r.squared', 
                                'Residual SE' = 'sigma',
                                'P-Value' = 'p.value')
                  )

(1)(2)(3)(4)(5)(6)(7)(8)(9)
(Intercept)5.677 ***5.713 ***4.914 ***4.907 ***4.928 ***4.951 ***4.821 ***71.062 ***4.321 ***
(0.049)   (0.047)   (0.041)   (0.041)   (0.041)   (0.044)   (0.041)   (11.424)   (0.069)   
prop_type_simplifiedEntire rental unit-0.132 ***-0.133 ***                                                        
(0.034)   (0.033)                                                           
prop_type_simplifiedEntire serviced apartment0.461 ***0.459 ***                                                        
(0.053)   (0.051)                                                           
prop_type_simplifiedOther-0.222 ***0.074                                                            
(0.037)   (0.041)                                                           
prop_type_simplifiedPrivate room in rental unit-0.844 ***-0.310 ***                                                        
(0.035)   (0.048)                                                           
number_of_reviews0.001 ***0.001 ***0.000 ** 0.000 ** 0.000 ** -0.000    -0.000 ** -0.001 ***        
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)           
review_scores_rating0.036 ***0.029 ***0.029 ***0.028 ***0.022 ** 0.024 ***0.038 ***0.175 ***        
(0.008)   (0.007)   (0.007)   (0.007)   (0.007)   (0.007)   (0.007)   (0.022)           
room_typeHotel room        0.440 ***0.717 ***0.719 ***0.707 ***0.700 ***0.416 ***0.370 ***0.397 ***
        (0.061)   (0.053)   (0.053)   (0.053)   (0.053)   (0.050)   (0.048)   (0.049)   
room_typePrivate room        -0.535 ***-0.544 ***-0.543 ***-0.544 ***-0.534 ***-0.542 ***-0.526 ***-0.536 ***
        (0.034)   (0.012)   (0.012)   (0.011)   (0.011)   (0.011)   (0.010)   (0.010)   
room_typeShared room        -0.906 ***-0.891 ***-0.853 ***-0.840 ***-0.868 ***-1.085 ***-1.149 ***-1.067 ***
        (0.060)   (0.053)   (0.051)   (0.051)   (0.050)   (0.047)   (0.050)   (0.047)   
bedrooms                0.076 ***0.081 ***0.082 ***0.088 ***0.108 ***0.112 ***        
                (0.013)   (0.013)   (0.013)   (0.013)   (0.012)   (0.012)           
bathrooms_num                0.151 ***0.154 ***0.149 ***0.137 ***0.132 ***0.124 ***0.181 ***
                (0.018)   (0.018)   (0.018)   (0.018)   (0.017)   (0.016)   (0.016)   
beds                0.016 *                                                  
                (0.007)                                                   
accommodates                0.127 ***0.136 ***0.135 ***0.134 ***0.112 ***0.108 ***0.135 ***
                (0.007)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.004)   
host_is_superhostTRUE                                0.111 ***0.088 ***0.094 ***0.066 ***0.041 ***
                                (0.017)   (0.014)   (0.016)   (0.012)   (0.011)   
number_of_reviews:host_is_superhostTRUE                                -0.001 ***        -0.000                    
                                (0.000)           (0.000)                   
neighbourhood_simplifiedFriedrichshain-Kreuzberg                                        0.018    0.076 ***0.101 ***        
                                        (0.021)   (0.019)   (0.023)           
neighbourhood_simplifiedMitte                                        0.063 ** 0.091 ***0.138 ***        
                                        (0.021)   (0.020)   (0.022)           
neighbourhood_simplifiedNeukölln                                        -0.125 ***-0.060 ** -0.050 *          
                                        (0.023)   (0.021)   (0.024)           
neighbourhood_simplifiedOther                                        -0.152 ***-0.156 ***-0.100 ***        
                                        (0.024)   (0.022)   (0.025)           
neighbourhood_simplifiedPankow                                        0.003    0.050 *  0.130 ***        
                                        (0.022)   (0.020)   (0.025)           
neighbourhood_simplifiedTempelhof - Schöneberg                                        -0.060 *  -0.032    -0.046            
                                        (0.027)   (0.025)   (0.026)           
availability_30                                                0.023 ***0.020 ***0.021 ***
                                                (0.001)   (0.001)   (0.001)   
reviews_per_month                                                0.010 *  0.024 ***        
                                                (0.004)   (0.005)           
review_scores_value                                                        -0.247 ***-0.222 ***
                                                        (0.019)   (0.015)   
review_scores_cleanliness                                                        0.123 ***0.155 ***
                                                        (0.013)   (0.012)   
review_scores_location                                                        0.181 ***0.217 ***
                                                        (0.016)   (0.015)   
review_scores_checkin                                                        -0.034 *          
                                                        (0.017)           
review_scores_accuracy                                                        -0.055 **         
                                                        (0.020)           
latitude                                                        -1.195 ***        
                                                        (0.212)           
longitude                                                        -0.298 **         
                                                        (0.095)           
number_of_reviews_l30d                                                        -0.019 ***        
                                                        (0.004)           
calculated_host_listings_count                                                        0.006 ***        
                                                        (0.001)           
#observations7562        7562        6903        6947        6942        6942        6942        6823        7396        
R squared0.369    0.412    0.509    0.510    0.513    0.527    0.593    0.620    0.582    
Adj. R Squared0.368    0.411    0.509    0.509    0.512    0.526    0.592    0.618    0.582    
Residual SE0.473    0.457    0.422    0.422    0.421    0.415    0.385    0.372    0.385    
P-Value0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    
*** p < 0.001; ** p < 0.01; * p < 0.05.
Looking at Model 9, plot of the residuals against fitted values has a mean hovering around 0. They don’t appear to be correlated with the fitted values. Furthermore, the Q-Q plot shows the residuals are approximately normally distributed (with perhaps a slight skew indicated by the deviation from the diagonal line). In addition, the variability is approximately constant indicated by the scale-location graph.

3.3 Prediction and analysis

We now take Model 9 RMSE on the out-of-sample data. We also calculate the RMSE on the in-sample data as a benchmark.

rmse_train<-listings_train %>% 
  mutate(predictions = predict(model9,.)) %>% 
  select(predictions,log_price_4_nights) %>% 
  mutate(squared_error = (predictions - log_price_4_nights)^2) %>%
  summarise(rmse = sqrt(mean(squared_error, na.rm=TRUE))) %>% 
  pull

rmse_train
[1] 0.384718
rmse_test <- listings_test %>% 
  mutate(predictions = predict(model9,.)) %>% 
  select(predictions,log_price_4_nights) %>% 
  mutate(squared_error = (predictions - log_price_4_nights)^2) %>% 
  summarise(rmse = sqrt(mean(squared_error, na.rm=TRUE))) %>% 
  pull()

rmse_test
[1] 0.3792494

The RMSEs are similar for the in-sample and out-of-sample data, so we can see that Model 9 does not over fit.

Finally, we calculate the predictions of price using Model 9 for apartments with a private room, have at least 10 reviews, and which have an average rating of at least 4.5

subset<-listings_test %>% 
  filter(
    room_type=="Private room",
    number_of_reviews>=10,
    review_scores_rating>=4.5
    )

predicted_logprices<-model9 %>% 
                    predict(subset) %>% 
                    as.data.frame()

favstats(~., data=predicted_logprices) %>% 
  mutate(lower=mean-1.96*sd/sqrt(n),
         upper=mean+1.96*sd/sqrt(n),
         lower_nominal=exp(lower), #Here we convert from log prices to prices.
         upper_nominal=exp(upper),
         mean_nominal=exp(mean),
         min_nominal=exp(min),
         max_nominal=exp(max)) %>% 
  select(min_nominal, 
         mean_nominal, 
         max_nominal, 
         lower_nominal,
         upper_nominal
         )

min_nominalmean_nominalmax_nominallower_nominalupper_nominal
117163414159166
Based on our predictions, the price for 4 nights is expected to be approximately 163 dollars, with a 95% confidence interval ranging from 158.79-166.34.

4 Acknowledgements